Overview

In this homework assignment, we will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. We will build multiple linear regression models on the continuous variable TARGET_AMT and binary logistic regression model on the boolean variable TARGET_FLAG to predict the probability that a person will crash their car, and to predict the associated costs.

We are going to build several models using different techniques and variable selection. In order to best assess our predictive models, we will create a validation set within our training data along an 80/20 training/testing proportion, before applying the finalized models to a separate evaluation dataset that does not contain the target.

1. Data Exploration

The insurance training dataset contains 8161 observations of 26 variables, each record represents a customer at an auto insurance company. The evaluation dataset contains 2141 observations of 26 variables. These include demographic measures such as age and gender, socioeconomic measures such as education and household income, and vehicle-specific metrics such as car model, age and assessed value.

Each record also has two response variables. The first response variable, TARGET_FLAG, is a boolean where “1” means that the person was in a car crash. The second response variable, TARGET_AMT is a numeric indicating the (positive) cost if a car crash occurred; this value is zero if the person did not crash their car.

We can explore a sample of the training data here, and make some initial observations:

  • Some of the variables are character though they should be numeric and vice-versa.
  • Some currency variables are strings with ‘$’ symbols instead of numerics.
  • Some character variables include a prefix z_ that could be removed for readability.

1.1 Summary Statistics

The table below provides valuable descriptive statistics about the training data:

Data summary
Name train_df
Number of rows 8161
Number of columns 25
_______________________
Column type frequency:
character 10
numeric 15
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
PARENT1 0 1 2 3 0 2 0
MSTATUS 0 1 2 3 0 2 0
SEX 0 1 1 1 0 2 0
EDUCATION 0 1 3 12 0 5 0
JOB 0 1 6 12 0 9 0
CAR_USE 0 1 7 10 0 2 0
CAR_TYPE 0 1 3 11 0 6 0
RED_CAR 0 1 2 3 0 2 0
REVOKED 0 1 2 3 0 2 0
URBANICITY 0 1 19 19 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
TARGET_FLAG 0 1.00 0.26 0.44 0 0 0 1 1.0
TARGET_AMT 0 1.00 1504.32 4704.03 0 0 0 1036 107586.1
KIDSDRIV 0 1.00 0.17 0.51 0 0 0 0 4.0
AGE 6 1.00 44.79 8.63 16 39 45 51 81.0
HOMEKIDS 0 1.00 0.72 1.12 0 0 0 1 5.0
YOJ 454 0.94 10.50 4.09 0 9 11 13 23.0
INCOME 445 0.95 61898.09 47572.68 0 28097 54028 85986 367030.0
HOME_VAL 464 0.94 154867.29 129123.77 0 0 161160 238724 885282.0
TRAVTIME 0 1.00 33.49 15.91 5 22 33 44 142.0
BLUEBOOK 0 1.00 15709.90 8419.73 1500 9280 14440 20850 69740.0
TIF 0 1.00 5.35 4.15 1 1 4 7 25.0
OLDCLAIM 0 1.00 4037.08 8777.14 0 0 0 4636 57037.0
CLM_FREQ 0 1.00 0.80 1.16 0 0 0 2 5.0
MVR_PTS 0 1.00 1.70 2.15 0 0 1 3 13.0
CAR_AGE 510 0.94 8.33 5.70 -3 1 8 12 28.0

/

Based on this summary table and exploration of the data, we can make the following observations:

  • 14 variables are categorical, 12 are numeric.
  • There is no missing data for character variables.
  • Numeric variables with missing values include YOJ (6%), INCOME (5%), HOME_VAL (6%), CAR_AGE (6%), and AGE (1%).
  • Most of the numeric variables have a minimum of zero.
  • One variable, CAR_AGE has a negative value of -3, which doesn’t make intuitive sense.

1.2 Distributions

Before building a model, we need to make sure that we have both classes equally represented in our TARGET_FLAG variable. Class 1 takes 27% and class 0 takes 63% of the target variable. As a result, we have unbalanced class distribution for our target variable that we have to deal with, we have to take some additional steps (bootstrapping, etc) before using logistic regression.

Distribution of Target Flag
Value %
0 0.74
1 0.26

Many of these distributions seem highly skewed and non-normal. As part of our data preparation we’ll use power transformations to find whether transforming variables to more normal distributions improves our models’ efficacy.

1.3 Box Plots

Commentary

1.4 Scatter Plot

Interestingly, none of our predictors appear to have strong linear relationships to our TARGET_AMT response variable, which is a primary assumption of linear regression. This suggests that alternative methods might be more successful in modeling the relationships.

1.5 Correlation Matrix

Commentary

2. Data preparation

2.1 Data types

In order to work with our training dataset, we’ll need to first convert some variables to more useful data types:

  • Convert currency columns from character to integer: INCOME,HOME_VAL,BLUEBOOK and OLDCLAIM.
  • Convert character columns to factors: TARGET_FLAG, CAR_TYPE, CAR_USE, EDUCATION, JOB, MSTATUS, PARENT1, RED_CAR, REVOKED, SEX and URBANICITY.

2.3 Transformations and Missing Values

Before we go further, we need to identify and handle any missing, NA or negative data values so we can perform log transformations and regression.

First, we’ll apply transformations to clean up and align formatting of our variables:

  • Drop the INDEX variable.
  • Remove “z_” from all character class values.
  • Update RED_CAR, replace [no,yes] values with [No, Yes] values.
  • Replace JOB blank values with ‘Unknown’.

Next, we’ll manually adjust two special cases of missing or outlier values.

  • In cases where YOJ is zero and INCOME is NA, we’ll set INCOME to zero to avoid imputing new values over legitimate instances of non-employment.
  • There is also at least one value of CAR_AGE that is less than zero - we’ll assume this is a data collection error and set it to zero (representing a brand-new car.)

We’ll use MICE to impute our remaining variables with missing values - AGE, YOJ, CAR_AGE, INCOME and HOME_VALUE:

  • We might reasonably assume that relationships exist between these variables (older, more years on the job may correlate with higher income and home value). Taking simple means or medians might suppress those features, but MICE should provide a better imputation.

Next we’ll want to consider any power transformations for variables that have skewed distributions. For example, our numeric response variable TARGET_AMT is a good candidate for transformation as its distribution is very highly skewed, and the assumption of normality is required in order to apply linear regression.

  • Log transformation will be applied to variables INCOME, TARGET_AMT, OLDCLAIM to transform their distributions from right-skewed to normally distributed.
  • Similarly, BoxCox transformation will be applied to variables BLUEBOOK, TRAVTIME, TIF, so they also are more normally distributed.

To give our models more variables to work with, we’ll engineer some additional features:

  • Create bin values for CAR_AGE, HOME_VAL and TIF.
  • Create dummy variables for two-level factors, MALE, MARRIED, LIC_REVOKED, CAR_RED, PRIVATE_USE, SINGLE_PARENT and URBAN.

We can examine our final, transformed training dataset and distributions below (with a temporary numeric variable CAR_CRASH to represent the response variable for visualization purposes.)

Data summary
Name train_df
Number of rows 8161
Number of columns 28
_______________________
Column type frequency:
factor 10
numeric 18
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
EDUCATION 0 1 FALSE 5 Hig: 2330, Bac: 2242, Mas: 1658, <Hi: 1203
JOB 0 1 FALSE 9 Blu: 1825, Cle: 1271, Pro: 1117, Man: 988
CAR_TYPE 0 1 FALSE 6 SUV: 2294, Min: 2145, Pic: 1389, Spo: 907
MALE 0 1 FALSE 2 0: 4375, 1: 3786
MARRIED 0 1 FALSE 2 1: 4894, 0: 3267
LIC_REVOKED 0 1 FALSE 2 0: 7161, 1: 1000
CAR_RED 0 1 FALSE 2 0: 5783, 1: 2378
PRIVATE_USE 0 1 FALSE 2 1: 5132, 0: 3029
SINGLE_PARENT 0 1 FALSE 2 0: 7084, 1: 1077
URBAN 0 1 FALSE 2 1: 6492, 0: 1669

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
TARGET_FLAG 0 1 0.26 0.44 0.00 0.00 0.00 1.00 1.00
TARGET_AMT 0 1 -1.21 5.69 -4.61 -4.61 -4.61 6.94 11.59
KIDSDRIV 0 1 0.17 0.51 0.00 0.00 0.00 0.00 4.00
AGE 0 1 44.78 8.63 16.00 39.00 45.00 51.00 81.00
HOMEKIDS 0 1 0.72 1.12 0.00 0.00 0.00 1.00 5.00
YOJ 0 1 10.48 4.10 0.00 9.00 11.00 13.00 23.00
INCOME 0 1 61461.41 47434.08 0.00 27684.00 53483.00 85479.00 367030.00
HOME_VAL 0 1 154644.85 129340.41 0.00 0.00 160874.00 238349.00 885282.00
TRAVTIME 0 1 15.08 5.82 3.00 11.17 15.34 19.12 45.61
BLUEBOOK 0 1 182.32 48.72 62.21 147.95 182.18 216.49 381.01
TIF 0 1 1.63 1.25 0.00 0.00 1.62 2.43 4.70
OLDCLAIM 0 1 0.56 6.54 -4.61 -4.61 -4.61 8.44 10.95
CLM_FREQ 0 1 0.80 1.16 0.00 0.00 0.00 2.00 5.00
MVR_PTS 0 1 1.70 2.15 0.00 0.00 1.00 3.00 13.00
CAR_AGE 0 1 8.34 5.70 0.00 1.00 8.00 12.00 28.00
CAR_AGE_BIN 0 1 2.48 1.12 1.00 1.00 2.00 3.00 4.00
HOME_VAL_BIN 0 1 2.45 1.16 1.00 1.00 2.00 3.00 4.00
TIF_BIN 0 1 2.41 1.16 1.00 1.00 2.00 3.00 4.00

2.4 Visualizations

We can use Mosaic Plots to illustrate the relationship of binary factor variables to TARGET_FLAG:

  • Observation
  • Observation

We can also use Mosaic Plots to illustrate the relationship of multi-level factor variables to TARGET_FLAG:

  • Observation
  • Observation

2.5 Training and Validation Sets

To proceed with modeling, we’ll split our training data into train (80%) and validation (20%) datasets.

3. Multiple Linear Regression

We’ll use Multiple Linear Regression to model the TARGET_AMT response variable, the estimated cost of a crash for a given observation.

3.1 Model 1 - Lasso Regression

The cv.glmnet() function was used to perform k-fold cross-validation with variable selection using lasso regularization. The following attribute settings were selected for the model:

  • type.measure = “mse” - The type.measure is set to minimize the Mean Squared Error for the model.
  • nfold = 10 - Given the size of the dataset we defaulted to 10-fold cross-validation.
  • family = gaussian - For Linear Regression
  • alpha = 1 - The alpha value of 1 sets the variable shrinkage method to lasso.
  • weights = a weight of 0.2638 / n for observation with a 0 TARGET_AMT and 0.7362 / n for all observations with all other values of TARGET_AMT.
  • standardize = TRUE - Finally, we explicitly set the standardization attribute to TRUE; this will normalize the prediction variables around a mean of zero and a standard deviation of one before modeling.

The resulting model is explored by extracting coefficients at two different values for lambda, lambda.min and lambda.1se respectively.

The coefficients extracted using lambda.min minimizes the mean cross-validated error. The resulting model includes 33 non-zero coefficients and has an AIC of 60.08. The coefficients extracted using lambda.1se produce the most regularized model (cross-validated error is within one standard error of the minimum). For this model there are 25 non-zero coefficients and it has an AIC of 44.23

The coefficients extracted using lambda.1se results in the lowest AIC (highest model performance) with fewer predictor variables.

## 
## Call:  cv.glmnet(x = X, y = Y, weights = weights_lm, type.measure = "mse",      nfolds = 10, family = "gaussian", standardize = TRUE, alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure     SE Nonzero
## min 0.02281    48   30.37 0.5402      35
## 1se 0.16092    27   30.85 0.4708      27

## $mse
## lambda.min 
##    30.2152 
## attr(,"measure")
## [1] "Mean-Squared Error"
## 
## $mae
## lambda.min 
##   4.756945 
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] 65.82961
## 
## $BIC
## [1] 302.8764
## $AICc
## [1] 49.90845
## 
## $BIC
## [1] 232.8399

A closer look at the remaining 37 non-zero coefficients for the selected lambda value of lambda.min (0.023) we can observe the top predictor variables URBANICITY Highly Urban/ Urban predictor variable has the largest impact on the response variable TARGET_AMT.

In the lasso model the coefficient for URBANICITY Highly Urban/ Urban home work area is biggest contributor to the cost estimates of a car crash by a factor of 2.

Reviewing the top 5 predictor variables that impact likelihood and cost associated with an accident:

  • URBANICITY Highly Urban/ Urban - working or living in an urban neighborhood increase expected cost associated with a crash
  • JOB Doctor - being a doctor reduces the expected costs associated with a crash
  • JOB Manager - being a manager reduces the expected costs associated with a crash
  • CAR_TYPE Sports Car - owning a sports car increases the expected costs associated with a crash
  • CAR_USE Private - using a car for private activities reduces the expected costs associated with a crash
  • REVOKED Yes - a history of having a revoked license increases the expected costs associated with a crash

Some of the notable coefficients that drop out of the model include:

  • HOME_VAL
  • JOB Professional
  • CAR_AGE
  • CAR_RED
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)           4.659086e-01
## KIDSDRIV              9.482740e-01
## AGE                  -6.302475e-03
## HOMEKIDS              7.138829e-02
## YOJ                  -9.512408e-03
## INCOME               -9.310518e-06
## HOME_VAL              .           
## EDUCATIONBachelors   -8.229800e-01
## EDUCATIONHigh School  2.545373e-02
## EDUCATIONMasters     -7.044465e-01
## EDUCATIONPhD         -2.771949e-01
## JOBClerical           2.184793e-01
## JOBDoctor            -2.371073e+00
## JOBHome Maker        -5.261066e-02
## JOBLawyer            -4.410511e-01
## JOBManager           -2.107991e+00
## JOBProfessional       .           
## JOBStudent           -9.671921e-02
## JOBUnknown           -5.052147e-01
## TRAVTIME              8.625996e-02
## BLUEBOOK             -9.206580e-03
## TIF                  -2.545276e-01
## CAR_TYPEPanel Truck   5.280531e-01
## CAR_TYPEPickup        7.602791e-01
## CAR_TYPESports Car    1.940375e+00
## CAR_TYPESUV           1.223260e+00
## CAR_TYPEVan           1.044559e+00
## OLDCLAIM              7.704577e-02
## CLM_FREQ              8.117799e-02
## MVR_PTS               2.366235e-01
## CAR_AGE               .           
## CAR_AGE_BIN           .           
## HOME_VAL_BIN         -2.829123e-01
## TIF_BIN              -2.167580e-01
## MALE1                 1.505128e-01
## MARRIED1             -1.080955e+00
## LIC_REVOKED1          1.566145e+00
## CAR_RED1              .           
## PRIVATE_USE1         -1.794435e+00
## SINGLE_PARENT1        8.514047e-01
## URBAN1                5.129086e+00

As mentioned earlier, the dataset has a high correlation between predictor variables. The lasso regression approaches this issue by selecting the variable with the highest correlation and shrinking the remaining variables (as can be seen in the plot of coefficients).

Model Performance

The lasso model using coefficients extracted at lambda.1se was used to predict the 60,421 test cases and comparing the predicted insurance AMT to the actual cost of a car crash. The predicted cost of the crash include negative numbers that are effectively 0. We selected a threshold cost and assigning 0 to all amounts below that threshold value. Since

In the training data 337.50 was the lowest crash cost included in the dataset. We used 100 as the measurement threshold and assume that all predicted costs below 100 dollars are effectively 0.

Using the yardstick package to measure model performance, mape, smape and mpe return NaNs while the mase = 0.578 and rmse = 4984.

## # A tibble: 5 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mape    standard      1301. 
## 2 smape   standard       199. 
## 3 mase    standard        16.5
## 4 mpe     standard       -83.5
## 5 rmse    standard       756.

Analyzing a scatter plot of the prediction errors vs measured costs and a comparative histogram of the predicted and measured costs highlights a shortcoming of the lasso model. The model consistently predicts crash costs lower than the actual measured crash costs. The gap is more pronounced when looking at predicted and ??????

Model Assumptions

To reduce multicollinearity we can use regularization that means to keep all the features but reducing the magnitude of the coefficients of the model. This is a good solution when each predictor contributes to predict the dependent variable.

The Standardized Residuals plot shows increasing variance at higher values of the response variable.

The lasso regression solves the multicollinearity issue by selecting the variable with the largest coefficient while setting the rest to (nearly) zero.

3.2 Model 2 - Stepwise Feature Selection

Model Performance

Adjusted R2: 0.22502
term estimate p.value
URBAN1 3.741 0.000
LIC_REVOKED1 1.719 0.000
MVR_PTS 0.256 0.000
PRIVATE_USE1 -1.516 0.000
TIF -0.366 0.000
CAR_TYPESports Car 1.649 0.000
KIDSDRIV 0.869 0.000
JOBManager -1.831 0.000
TRAVTIME 0.069 0.000
CAR_TYPESUV 0.991 0.000
BLUEBOOK -0.009 0.000
OLDCLAIM 0.060 0.000
SINGLE_PARENT1 1.137 0.000
MARRIED1 -0.830 0.000
CAR_TYPEPickup 0.789 0.000
CAR_TYPEVan 0.947 0.000
INCOME 0.000 0.001
JOBDoctor -1.779 0.001
HOME_VAL_BIN -0.575 0.001
EDUCATIONBachelors -0.702 0.003
JOBUnknown -0.999 0.012
HOME_VAL 0.000 0.034
CAR_TYPEPanel Truck 0.621 0.051
JOBLawyer -0.653 0.088
(Intercept) -0.880 0.109
EDUCATIONMasters -0.518 0.123
YOJ -0.027 0.133
JOBStudent -0.381 0.195
JOBProfessional -0.278 0.289
JOBHome Maker -0.328 0.323
EDUCATIONPhD -0.304 0.457
EDUCATIONHigh School 0.065 0.757
JOBClerical 0.030 0.900

The resulting model is much more parsimonious than the first, with statistically significant results for three predictors, bluebook, mvr_pts and mstatus_yes.

The Adjusted R-Squared is better than Model 1 but still very low (0.0167) meaning this model only explains about 1.7% of the total variance in the response variable target_amt. However, an examination of the residuals indicates most of the key assumptions for linear regression are met - the Residuals vs Fitted plot shows a more constant variability of the residuals, and the Q-Q plot indicates a greater level of normality.

The summary table includes the estimate transformed to original scale for easier interpretation.In this case, the ‘base’ target_amt would be estimated at $3,434.76 with an increase in 1% per each dollar of bluebook value, a 1.07% increase if the driver were male, and 0.9% decrease if the driver were married.

Model Assumptions

3.3 Model 3 -

Model Performance

3.4 Model selection

4. Binary Logistic Regression

We’ll use Binary Logistic Regression to classify our response variable TARGET_FLAG, the probability of a car crash for a given observation.

4.1 Model 1 - Lasso

Lasso Regression may be a good candidate for this dataset, since we are dealing with a large number of complex variables. Lasso helps identify the most important variables and reduces the model complexity.

The cv.glmnet() function was also used as logistic regression model. Similar to the regression model k-fold cross-validation was performed with variable selection using lasso regularization. The following attribute settings were selected for the model:

  • type.measure = “class” - The type.measure is set to class to minimize the mis-classification errors of the model since the accurate classification of the validation data set is the desired outcome.
  • nfold = 10 - Given the size of the training dataset, we opted for 10-fold cross-validation as a default.
  • family = binomial - For Logistic Regression, the family attribute of the function is set to binomial.
  • link = logit - For this model, we choose the default link function for a logistic model.
  • alpha =1 - The alpha value of 1 sets the variable shrinkage method to lasso.
  • weights = a weight of 0.2638 / n for observation with a 0 TARGET_FLAG and 0.7362 / n observations with a 1 value of TARGET_FLAG.
  • standardize = TRUE - Finally, we explicitly set the standardization attribute to TRUE; this will normalize the prediction variables around a mean of zero and a standard deviation of one before modeling.

The resulting model is explored by extracting coefficients at two different values for lambda, lambda.min and lambda.1se respectively.

  • The coefficients extracted using lambda.min minimizes the mean cross-validated error. The resulting model includes 35 non zero coefficients and has an AIC of -1605.418.
  • The coefficients extracted using lambda.1se produce the most regularized model (cross-validated error is within one standard error of the minimum). The resulting model includes 25 no zero coefficients and has an AIC of -1503.695,

The coefficients extracted using lambda.min results in the lowest AIC and highest performance model.

## 
## Call:  cv.glmnet(x = X, y = Y, nfolds = 5, family = "binomial", link = "logit",      standardize = TRUE, alpha = 1) 
## 
## Measure: Binomial Deviance 
## 
##       Lambda Index Measure       SE Nonzero
## min 0.001364    48  0.9064 0.006829      35
## 1se 0.006043    32  0.9129 0.005470      28

## $deviance
## lambda.min 
##  0.8952216 
## attr(,"measure")
## [1] "Binomial Deviance"
## 
## $class
## lambda.min 
##  0.2052696 
## attr(,"measure")
## [1] "Misclassification Error"
## 
## $auc
## [1] 0.8128385
## attr(,"measure")
## [1] "AUC"
## 
## $mse
## lambda.min 
##  0.2899812 
## attr(,"measure")
## [1] "Mean-Squared Error"
## 
## $mae
## lambda.min 
##  0.5863108 
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -1618.664
## 
## $BIC
## [1] -1381.617
## $AICc
## [1] -1576.707
## 
## $BIC
## [1] -1387.009

A closer look at the remaining 36 non-zero coefficients for the selected lambda value of lambda.min we can observe the URBANICITY Highly Urban/ Urban predictor variable has the largest impact on the prediction of a car crash by a factor of three. Reviewing the top 5 predictor variables that impact likelihood and cost associated with an accident:

  • URBANICITY Highly Urban/ Urban - working or living in an urban neighborhood increase expected cost associated with a crash
  • CAR_USE Private - using a car for private activities reduces
  • JOB Manager - being a manager reduces the expected costs associated with a crash
  • REVOKED Yes - a history of having a revoked license increases the expected costs associated with a crash
  • JOB Doctor - being a doctor reduces the expected costs associated with a crash

The JOBStudent coefficient is the only predictor variable that drops out, however several variable including INCOME, HOME_VAL and OLDCLAIM are srunk substantially.

## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -1.503148e+00
## KIDSDRIV              4.104985e-01
## AGE                  -2.985770e-03
## HOMEKIDS              1.467625e-02
## YOJ                  -5.603125e-03
## INCOME               -3.789459e-06
## HOME_VAL              .           
## EDUCATIONBachelors   -3.280895e-01
## EDUCATIONHigh School  1.985406e-02
## EDUCATIONMasters     -3.554292e-01
## EDUCATIONPhD         -1.740929e-01
## JOBClerical           9.929721e-02
## JOBDoctor            -7.568191e-01
## JOBHome Maker         .           
## JOBLawyer            -5.063537e-02
## JOBManager           -7.474078e-01
## JOBProfessional      -2.970786e-04
## JOBStudent           -6.706470e-02
## JOBUnknown           -1.378131e-01
## TRAVTIME              3.655349e-02
## BLUEBOOK             -4.750919e-03
## TIF                  -6.142600e-02
## CAR_TYPEPanel Truck   3.134704e-01
## CAR_TYPEPickup        3.633557e-01
## CAR_TYPESports Car    8.355260e-01
## CAR_TYPESUV           4.954207e-01
## CAR_TYPEVan           4.529227e-01
## OLDCLAIM              2.156472e-02
## CLM_FREQ              5.234470e-02
## MVR_PTS               1.014769e-01
## CAR_AGE               .           
## CAR_AGE_BIN           .           
## HOME_VAL_BIN         -1.209622e-01
## TIF_BIN              -1.325964e-01
## MALE1                 2.502728e-02
## MARRIED1             -4.244342e-01
## LIC_REVOKED1          7.337489e-01
## CAR_RED1              .           
## PRIVATE_USE1         -7.770794e-01
## SINGLE_PARENT1        4.101707e-01
## URBAN1                2.218263e+00

Model Performance

The coefficients extracted at the lambda.min value are used to predict the likelihood of an accident. The confusion matrix highlights an accuracy of 73.7%.

##          True
## Predicted    0   1 Total
##     0     1108 266  1374
##     1       94 165   259
##     Total 1202 431  1633
## 
##  Percent Correct:  0.7795

Checking Model Assumptions

Again we check linear relationship between independent variables and the Logit of the target variable. Visually inspecting the results there is a linear trend in the relationship but there are deviations from the straight line in all variables. The lasso regression solves the multicollinearity issue by selecting the variable with the largest coefficient while setting the rest to (nearly) zero.

4.2 Model 2 - Stepwise Feature Selection

4.3 Model 3 -

4.4 Model selection

** Model 3.1: Lasso Linear Regression **

** Model 4.1: Lasso Logistic Regression **

##        F1 
## 0.4782609

Regression

Model mape smape mase mpe rmse AIC
M4.1:Lasso Linear 1301.105 199.0998 16.4806 -83.4733 756.1637 65.8296

Logistic

Model Accuracy Classification error rate F1 Deviance R2 Sensitivity Specificity Precision AIC
M4.1:Lasso Logistic 0.8132 0.1868 0.4783 0.8978 NA 0.3828 0.9218 0.6371 -1618.664

5. Predictions

6. Conclusion

7. References

Appendix: R code